10 research outputs found
Viability of Sequence Labeling Encodings for Dependency Parsing
Programa Oficial de Doutoramento en Computación . 5009V01[Abstract]
This thesis presents new methods for recasting dependency parsing as
a sequence labeling task yielding a viable alternative to the traditional
transition- and graph-based approaches. It is shown that sequence labeling
parsers provide several advantages for dependency parsing, such
as: (i) a good trade-off between accuracy and parsing speed, (ii) genericity
which enables running a parser in generic sequence labeling software
and (iii) pluggability which allows using full parse trees as features to
downstream tasks.
The backbone of dependency parsing as sequence labeling are the encodings
which serve as linearization methods for mapping dependency
trees into discrete labels, such that each token in a sentence is associated
with a label. We introduce three encoding families comprising: (i)
head selection, (ii) bracketing-based and (iii) transition-based encodings
which are differentiated by the way they represent a dependency
tree as a sequence of labels. We empirically examine the viability of
the encodings and provide an analysis of their facets.
Furthermore, we explore the feasibility of leveraging external complementary
data in order to enhance parsing performance. Our sequence
labeling parser is endowed with two kinds of representations. First,
we exploit the complementary nature of dependency and constituency
parsing paradigms and enrich the parser with representations from both
syntactic abstractions. Secondly, we use human language processing
data to guide our parser with representations from eye movements.
Overall, the results show that recasting dependency parsing as sequence
labeling is a viable approach that is fast and accurate and provides
a practical alternative for integrating syntax in NLP tasks.[Resumen]
Esta tesis presenta nuevos métodos para reformular el análisis sintáctico
de dependencias como una tarea de etiquetado secuencial, lo
que supone una alternativa viable a los enfoques tradicionales basados
en transiciones y grafos. Se demuestra que los analizadores de etiquetado
secuencial ofrecen varias ventajas para el análisis sintáctico de
dependencias, como por ejemplo (i) un buen equilibrio entre la precisión
y la velocidad de análisis, (ii) la genericidad que permite ejecutar
un analizador en un software genérico de etiquetado secuencial y (iii)
la conectividad que permite utilizar el árbol de análisis completo como
características para las tareas posteriores.
El pilar del análisis sintáctico de dependencias como etiquetado secuencial
son las codificaciones que sirven como métodos de linealización
para transformar los árboles de dependencias en etiquetas discretas, de
forma que cada token de una frase se asocia con una etiqueta. Introducimos
tres familias de codificación que comprenden: (i) selección de
núcleos, (ii) codificaciones basadas en corchetes y (iii) codificaciones basadas
en transiciones que se diferencian por la forma en que representan
un árbol de dependencias como una secuencia de etiquetas. Examinamos
empíricamente la viabilidad de las codificaciones y ofrecemos un
análisis de sus facetas.
Además, exploramos la viabilidad de aprovechar datos complementarios
externos para mejorar el rendimiento del análisis sintáctico. Dotamos
a nuestro analizador sintáctico de dos tipos de representaciones. En
primer lugar, explotamos la naturaleza complementaria de los paradigmas
de análisis sintáctico de dependencias y constituyentes, enriqueciendo
el analizador sintáctico con representaciones de ambas abstracciones
sintácticas. En segundo lugar, utilizamos datos de procesamiento del
lenguaje humano para guiar nuestro analizador con representaciones de
los movimientos oculares.
En general, los resultados muestran que la reformulación del análisis
sintáctico de dependencias como etiquetado de secuencias es un enfoque
viable, rápido y preciso, y ofrece una alternativa práctica para integrar
la sintaxis en las tareas de PLN.[Resumo]
Esta tese presenta novos métodos para reformular a análise sintáctica
de dependencias como unha tarefa de etiquetaxe secuencial, o que
supón unha alternativa viable aos enfoques tradicionais baseados en
transicións e grafos. Demóstrase que os analizadores de etiquetaxe secuencial
ofrecen varias vantaxes para a análise sintáctica de dependencias,
por exemplo (i) un bo equilibrio entre a precisión e a velocidade
de análise, (ii) a xenericidade que permite executar un analizador nun
software xenérico de etiquetaxe secuencial e (iii) a conectividade que
permite empregar a árbore de análise completa como características
para as tarefas posteriores.
O piar da análise sintáctica de dependencias como etiquetaxe secuencial
son as codificacións que serven como métodos de linealización para
transformar as árbores de dependencias en etiquetas discretas, de forma
que cada token dunha frase se asocia cunha etiqueta. Introducimos
tres familias de codificación que comprenden: (i) selección de núcleos,
(ii) codificacións baseadas en corchetes e (iii) codificacións baseadas en
transicións que se diferencian pola forma en que representan unha árbore
de dependencia como unha secuencia de etiquetas. Examinamos
empíricamente a viabilidade das codificacións e ofrecemos unha análise
das súas facetas.
Ademais, exploramos a viabilidade de aproveitar datos complementarios
externos para mellorar o rendemento da análise sintáctica. O noso
analizador sintáctico de etiquetaxe secuencial está dotado de dous tipos
de representacións. En primeiro lugar, explotamos a natureza complementaria
dos paradigmas de análise sintáctica de dependencias e constituíntes
e enriquecemos o analizador sintáctico con representacións de
ambas abstraccións sintácticas. En segundo lugar, empregamos datos
de procesamento da linguaxe humana para guiar o noso analizador con
representacións dos movementos oculares.
En xeral, os resultados mostran que a reformulación da análise sintáctico
de dependencias como etiquetaxe de secuencias é un enfoque
viable, rápido e preciso, e ofrece unha alternativa práctica para integrar
a sintaxe nas tarefas de PLN.This work has been carried out thanks to the funding from
the European Research Council (ERC), under the European Union’s
Horizon 2020 research and innovation programme (FASTPARSE, grant
agreement No 714150)
Sequence Tagging for Fast Dependency Parsing
[Abstract]
Dependency parsing has been built upon the idea of using parsing methods based on shift-reduce or graph-based algorithms in order to identify binary dependency relations between the words in a sentence. In this study we adopt a radically different approach and cast full dependency parsing as a pure sequence tagging task. In particular, we apply a linearization function to the tree that results in an output label for each token that conveys information about the word’s dependency relations. We then follow a supervised strategy and train a bidirectional long short-term memory network to learn to predict such linearized trees. Contrary to the previous studies attempting this, the results show that this approach not only leads to accurate but also fast dependency parsing. Furthermore, we obtain even faster and more accurate parsers by recasting the problem as multitask learning, with a twofold objective: to reduce the output vocabulary and also to exploit hidden patterns coming from a second parsing paradigm (constituent grammars) when used as an auxiliary task.Ministerio de Economía y Competitividad; TIN2017-85160-C2-1-RXunta de Galicia; ED431B 2017/0
Sequence Labeling Parsing by Learning Across Representations
We use parsing as sequence labeling as a common framework to learn across
constituency and dependency syntactic abstractions. To do so, we cast the
problem as multitask learning (MTL). First, we show that adding a parsing
paradigm as an auxiliary loss consistently improves the performance on the
other paradigm. Secondly, we explore an MTL sequence labeling model that parses
both representations, at almost no cost in terms of performance and speed. The
results across the board show that on average MTL models with auxiliary losses
for constituency parsing outperform single-task ones by 1.14 F1 points, and for
dependency parsing by 0.62 UAS points.Comment: Proc. of the 57th Annual Meeting of the Association for Computational
Linguistics (ACL 2019). Revised version after fixing evaluation bu
Parsing as Pretraining
[Abstract] Recent analyses suggest that encoders pretrained for language
modeling capture certain morpho-syntactic structure.
However, probing frameworks for word vectors still do not report
results on standard setups such as constituent and dependency
parsing. This paper addresses this problem and does
full parsing (on English) relying only on pretraining architectures
– and no decoding. We first cast constituent and dependency
parsing as sequence tagging. We then use a single
feed-forward layer to directly map word vectors to labels that
encode a linearized tree. This is used to: (i) see how far we can
reach on syntax modelling with just pretrained encoders, and
(ii) shed some light about the syntax-sensitivity of different
word vectors (by freezing the weights of the pretraining network
during training). For evaluation, we use bracketing F1-score and LAS, and analyze in-depth differences across representations for span lengths and dependency displacements. The overall results surpass existing sequence tagging parsers on the PTB (93.5%) and end-to-end EN-EWT UD (78.8%).We thank Mark Anderson and Daniel Hershcovich for their comments. DV, MS and CGR are funded by the ERC under the European Union’s Horizon 2020 research and innovation programme (FASTPARSE, grant No 714150), by the ANSWER-ASAP project (TIN2017-85160-C2-1-R) from MINECO, and by Xunta de Galicia (ED431B 2017/01). AS is funded by a Google Focused Research AwardXunta de Galicia; ED431B 2017/0
Sequence Tagging for Fast Dependency Parsing
Dependency parsing has been built upon the idea of using parsing methods based on shift-reduce or graph-based algorithms in order to identify binary dependency relations between the words in a sentence. In this study we adopt a radically different approach and cast full dependency parsing as a pure sequence tagging task. In particular, we apply a linearization function to the tree that results in an output label for each token that conveys information about the word’s dependency relations. We then follow a supervised strategy and train a bidirectional long short-term memory network to learn to predict such linearized trees. Contrary to the previous studies attempting this, the results show that this approach not only leads to accurate but also fast dependency parsing. Furthermore, we obtain even faster and more accurate parsers by recasting the problem as multitask learning, with a twofold objective: to reduce the output vocabulary and also to exploit hidden patterns coming from a second parsing paradigm (constituent grammars) when used as an auxiliary task
"Jeg forstår (ikke) norsken din!". En sosiolingvistisk studie i forståelse av norske dialekter blant polske studenter i Oslo.
Denne oppgaven handler om polske norskinnlæreres evne til å forstå og lokalisere fem norske dialekter. Temaet blir belyst ved hjelp av kvalitative og kvantitative metoder. Ti polske informanter svarte på en spørreundersøkelse og tok en dialekttest. En kontrollgruppe bestående av fem norske informanter tok også dialekttesten. Dette gjorde det mulig å sammenligne resultatene til målgruppen med kontrollgruppen. Informantene ble testet i dialektene Oslo, Bergen, Tromsø, Stavanger og Trondheim.
Analysen av resultatene av undersøkelsen ble gjennomført i tre deler. Den første delen viser i hvilken grad de polske og norske informantene svarte riktig på spørsmål som testet generell forståelse av de enkelte dialektene. Den andre delen viser i hvilken grad kontroll- og målgruppen lokaliserte de fem dialektene riktig geografisk. I den tredje delen er mulige faktorer som kan ha påvirket resultatene til de polske informantene presentert.
Analysen viste at de polske informantene oppnådde lavere resultater på dialekttesten enn de norske informantene. Hypotesen om at de polske informantene forstår oslodialekt bedre enn de andre fire norske dialektene ble bekreftet. I tillegg var det stor variasjon i resultatene til de polske informantene. Dette gjelder både generell forståelse og geografisk lokalisering av de norske dialektene. Videre viser undersøkelsen at de polske informantene hadde liten kjennskap til dialektale ord. I tillegg hadde de problemer med å gjenkjenne ord de hadde kjenskap til, noe som kan tyde på at de kun har lagret en fonologisk representasjon av et ord og at representasjonen ikke alltid omfatter uttale av det samme ordet på en annen dialekt. Enkel regresjonsanalyse viser at de polske informantene som hadde bodd lengst i Norge gjennomsnittlig skåret best i generell forståelse av de utvalgte dialektene, men skåret dårligere på geografisk lokalisering. Studien tyder på at ”fortrolighetseffekten” (eng. ”familiarity effect”) mellom norske dialekter ser ut til å ha funnet sted hos de polske informantene. Denne undersøkelsen tyder på at andrespråksinnlærere kan ha nytte av å få mer opplæring i norske dialekter, for eksempel mer trening i å lytte til norske dialekter, opplæring i dialekttrekk, dialektale ord og dialektenes geografi på kurs for andrespråksinnlærere av norsk